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REMARKS 

Claims 1-38 were pending the application. Claims 1, 4, 5, 6, 7, and 15 have been 
amended, claims 2, 3, 8, 9, and 17-39 have been canceled, and new claims 40-44 have 
been added. Accordingly, upon entry of this amendment, claims 1, 4-7, 10-16, and 40-44 
will remain pending. 

Support for the amendments to the claims and the new claims may be found 
throughout the specification, including the originally filed claims. In particular, support 
for the amendments to claims 5 and 6 and new claim 42 may be found in Applicants' 
specification at, for example, page 9, lines 24-26 and in Table 1 . 

No new matter has been added. Any amendments to the claims was done solely 
to more particularly point out and distinctly claim the subject matter of Applicants' 
invention in order to expedite the prosecution of the application. Applicants reserve the 
right to pursue the claims as originally filed in this or a separate application(s). 

Objection to Claim 2 

The Examiner has objected to claim 2 under 37 C.F.R. 1.75 "as being a substantial 
duplicate of claim 1 ." In particular, the Examiner is of the opinion that "[w]hen two claims in 
an application are duplicates or else are so close in content that they both cover the same thing, 
despite a slight difference in wording, it is proper after allowing one claim to object to the 
other as being a substantial duplicate of the allowed claim. See MPEP § 706.03(k) 

Applicants respectfully traverse the foregoing objection to claim 2. However, in an 
effort to expedite prosecution of the instant application, and in no way acquiescing to the 
Examiner's rejection, claim 2 has been canceled, thereby rendering the foregoing objection 
moot. 

Rejection of Claims 1-1 7 Under 35 U.S.C. §101 

The Examiner has rejected claims 1-17 under 35 U.S.C. §101 because, according to 
the Examiner, "the claimed invention is not supported by either a credible asserted utility or a 
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well established utility." In particular, the Examiner is of the opinion that "[t]he 
specification does not show any enzyme assays to demonstrate that the encoded protein has a 
diaminopimelate epimerase activity." Furthermore, the Examiner is of the opinion that 

[t]he state of the art as exemplified by Attwood et al. (Comput. Chem. 
2001, Vol. 25(4), pp. 329-39) is such that "...we do not fully 
understand the rules of protein folding, so we cannot predict protein 
structure; and we cannot invariably diagnose protein function, given 
knowledge only of its sequence or structure in isolation" (see Abstract 
and entire publication). Furthermore, Ponting (Brief. Bioinform. 
March 2001, Vol. 2(1), pp. 19-29) states that "...predicting function 
by homology is a qualitative, rather than quantitative, process and 
requires particular care to be taken... due attention should be paid to all 
available clues to function, including orthologue identification, 
conservation of particular residue types, and the co-occurrence of 
domains in proteins" (See Abstract and entire publication). 
The specification does not specifically disclose the specific function 
of the protein of SEQ ID NO: 2. It appears that the main utility of the 
nucleic acid and protein is to carry out further research to identify the 
biological function and substantial utilities associated with the 
protein. 

Applicants respectfully traverse the foregoing rejection, and assert that a specific, 
substantial and well-established utility, which would have been credible to one skilled in 
the art at the time of the invention, is clearly disclosed in the instant specification. 
Applicants have described the chemical, physical and biological properties, as well as the 
functional characteristics of several "metabolic pathway (MP)" molecules, including the 
claimed sequences which encode diaminopimelate epimerase polypeptides, in the instant 
specification in detail, at least, for example, at page 9, lines 24-35, page 16, line 7, 
through page 18, line 27, page 42, line 1 through page 45, line 34, and in Table 1. At 
page 22, lines 30-34 of Applicants' specification, it is stated that "'the function of an MP 
protein' contributes to the overall functioning of one or more such metabolic pathway 
and contributes, either directly or indirectly, to the yield, production, and/or efficiency of 
production of one or more fine chemicals. Examples of MP protein activities are set forth 
in Table 1." 
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By performing, at least, structural and sequence analysis, Applicants have 
determined that these diaminopimelate epimerase polypeptides have significant structural 
homology to a known class of diaminopimelate epimerase proteins having a well- 
established function in the yield, production, and/or efficiency of production of one or 
more of fine chemicals, including lysine, as is clearly set forth in Applicants' 
specification. 

While examples exist of polypeptide families wherein individual members have 
distinct, even opposite, biological activities, growing databases and improved search 
techniques, particularly the iterated PSI-BLAST tool, has yielded substantial 
improvement in secondary structure prediction accuracy. According to Rost, a copy of 
which is submitted herewith as Appendix A, "[secondary structure predictions are 
increasingly becoming the work horse for numerous methods aimed at predicting protein 
structure and function." Burkhard Rost, Review: Protein Secondary Structure prediction 
Continues to Rise (2001) J. Structural Biology 134: 204-218. 

The Utility Guidelines indicate that evidence of structural similarity of a 
compound (e.g., a polypeptide) with a compound known to have a particular activity, is 
supportive of an assertion of a credible utility. The Utility Guidelines further state that 
"Office personnel should evaluate not only the existence of the structural relationship, but 
also the reasoning used by the applicant or a declarant to explain why that structural 
similarity is believed to be relevant to the applicant's assertion of utility" (see page 19, 
section II (B) of the Utility Guidelines). Based on the Utility Guidelines as set forth 
above, in combination with the teachings of the instant specification regarding the 
structural similarity of the molecules of the instant invention to diaminopimelate 
epimerase polypeptides, one of skill in the art would find that the asserted utility of the 
claimed molecules to be credible. 

As the Examiner is aware, the applicant does not have to provide evidence 
sufficient to establish that an asserted utility is true "beyond reasonable doubt." In re 
Irons, 340 F.2d 974, 978, 144 USPQ 351, 354 (CCPA 1965). Instead, evidence will be 
sufficient, if considered as a whole, it leads a person of ordinary skill in the art to 
conclude that the asserted utility is more likely than not true. M.P.E.P. §2164.07. 
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Applicants' specification contains ample teachings regarding the role and importance of 
diaminopimelate epimerase molecules in the production of fine chemicals, including 
lysine. Applicants respectfully submit that a person of ordinary skill in the art would 
conclude that Applicants' asserted utility is more likely than not true, which is all that is 
required under 35 U.S.C. §101. 

In view of the foregoing, Applicants assert that each of the utilities set forth in the 
specification for the invention as instantly claimed are specific, credible and substantial 
and/or well-established utilities that would have been recognized as such by one of skill 
in the art at the time the application was filed. Therefore, the instant claims meet the 
requirements of 35 U.S.C. §101, and Applicants respectfully request reconsideration and 
withdrawal of this rejection. 



Rejection of Claims 1-17 Under 35 U.S.G § 112, First Paragraph 

The Examiner has rejected claims 1-17 under 35 U.S.C. §112, first paragraph. In 
particular, the Examiner is of the opinion that "since the claimed invention is not supported 
by either a credible asserted utility or a well established utility for the reasons set forth above 
in the rejection of claims 1-17 under 35 U.S.C. §101, one skilled in the art clearly would not 
know how to use the claimed invention." The Examiner is further of the opinion that 



[f]urthermore, claim 6 which encompasses any nucleic acid molecule 
comprising a nucleotide sequence which has at least 50% identity to 
SEQ ID NO:l or complement thereof is not enabled by the 
specification. . . [t]he specification provides guidance a nucleic acid 
molecule consisting of the nucleotide sequence of SEQ ID NO: 1 or a 
nucleic acid molecule encoding a protein consisting of the amino acid 
sequence of SEQ ID NO: 2. While molecular biological techniques 
and genetic manipulation are known in the prior art and the skill of the 
artisan are well developed, knowledge regarding the nucleotides to 
change, i.e. delete, insert, substitute, and combinations thereof, to 
make a nucleic acid molecule having at least 50% identity to SEQ ID 
NO: 1 is lacking. Furthermore, knowledge regarding the biological 
function of a nucleic acid molecule having at least 50% identity to 
SEQ ID NO: 1 is lacking. Thus, searching for specific nucleotides to 
change to make the claimed nucleic acid is well outside the realm of 
routine experimentation and predictability in the art of success is 
extremely low. Also well outside the realm of experimentation is 
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identifying the function and use of the nucleic acid molecule having 
at least 50% identity to SEQ ID NO: 1 . The amount of 
experimentation to determine what specific nucleotides to change to 
make the claimed nucleic acid molecule and determining the 
biological function and utility of the claimed nucleic acid molecule is 
enormous. Such experimentation entails selecting specific 
nucleotides to change, i.e. delete, insert, substitute, and combinations 
thereof, to make a nucleic acid molecule having at least 50% identity 
to SEQ ID NO: 1 and determining the biological function and utility 
of the nucleic acid. Since routine experimentation in the art does not 
include such enormous experimentation, where the expectation of 
determining the biological function of a nucleic acid molecule having 
at least 50% identity to SEQ ID NO: 1 is unpredictable, the Examiner 
finds that one skilled in the art would require additional guidance, 
such as information regarding the specific nucleotides to change and 
the biological function and utility of the nucleic acid molecule. 
Without such guidance, the experimentation left to those skilled in the 
art is undue. 

Applicants have canceled claims 2, 3, 8, 9, and 17 thus rendering the instant 
rejection moot as it pertains to these claims. With respect to claims 1, 4-6, 7, 10-16, and 
new claims 40, 41, 42, and 43 Applicants respectfully traverse the foregoing rejection and 
submit that as indicated above, the claimed invention has specific, substantial, and 
credible utilities and, thus, one of skill in the art would know how to use the claimed 
invention. 

There is sufficient written description in Applicants' specification regarding 
nucleic acid molecules comprising SEQ ID NO:l, nucleic acid molecules with a 
significant degree of homology to SEQ ID NO:l, which encode polypeptides which are 
capable of functioning as diaminopimelate epimerase polypeptides, or polypeptides 
which are capable of modulating production of a fine chemical, to inform a skilled artisan 
that Applicants were in possession of the claimed invention at the time the application 
was filed as required by section 1 12, first paragraph (see M.P.E.P. 2163.02). In order to 
meet the written description requirement of the first paragraph of 35 U.S.C. §1 12, it is not 
necessary that a patent specification describe each and every specific member of a genus 
recited in a claim. 
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Furthermore, with respect to rejected claim 6 and new claims 42 and 43, a claim 
to a genus of chemical compounds satisfies the written description requirement when its 
accompanying specification either defines by sequence a representative number of its 
members falling within the scope of the genus or when its accompanying specification 
defines the structural features common to a substantial portion of the genus (The Regents 
of the University of California v. Eli Lilly and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 
1997)). For reasons discussed in detail below, the instant specification satisfies this 
requirement for the claimed invention. 

Claim 6 is directed to isolated nucleic acid molecules comprising a nucleotide 
sequence which has at least 90% identity with the nucleotide sequence of SEQ ID NO:l, 
or a complement thereof, and wherein said nucleic acid molecule encodes a polypeptide 
which is capable of functioning as a diaminopimelate epimerase. Claim 42 is directed to 
isolated nucleic acid molecules consisting of a nucleotide sequence which has at least 
90% identity with the nucleotide sequence of SEQ ID NO:l, or a complement thereof, 
and wherein said nucleic acid molecule encodes a polypeptide which is capable of 
functioning as a diaminopimelate epimerase. 

Claims 6, 42 and 43 are not directed to any and/or all polynucleotides but rather 
are directed only to those that are encoded by a nucleic acid molecule with a high degree 
of identity to SEQ ID NO:l and which encode functional diaminopimelate epimerase 
polypeptides or polypeptides which are capable of modulating the production of a fine 
chemical. 

Example 14 of the Revised Interim Written Description Guidelines Training 
Materials provides that a claim directed to variants of a polypeptide having SEQ ID 
NO:3 "that are at least 95% identical to SEQ ID NO:3 and catalyze the reaction of A->B : 
with an accompanying specification that discloses a single species falling within the 
claimed genus, satisfies the requirements of 35 U.S.C. §112, first paragraph for written 
description. The rationale behind the foregoing conclusion, as presented by the Written 
Description Guidelines, is that "[t]he single species disclosed is representative of the 
genus because all members have at least 95% structural identity with the reference 
compound and because of the presence of an assay which Applicant provided for 
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identifying all of the at least 95% identical variants of SEQ ID NO:3 which are capable of 
the specified catalytic activity." 

Similarly, in the present case, claims 6 and 42 are directed to isolated nucleic acid 
molecules comprising or consisting of a nucleotide sequence that is at least 90% identical 
to the nucleotide sequence shown in SEQ ID NO:l, wherein the nucleotide sequence 
encodes a polypeptide capable of functioning as a diaminopimelate epimerase 
polypeptide or polypeptides which are capable of modulating the production of a fine 
chemical. 

Applicants have disclosed in the instant specification assays for identifying all of 
the at least 90% identical variants of SEQ ID NO:l which encode polypeptides capable of 
functioning as a diaminopimelate epimerase polypeptide or polypeptides which are 
capable of modulating the production of a fine chemical (see, for example, page 54, line 
26 through page 55, line 20 of Applicants' specification). 

Based on the foregoing teachings in Applicants' specification and the knowledge 
generally available in the art, one skilled in the art would conclude that Applicants were 
in possession of the claimed invention at the time of filing of the application. The skilled 
artisan would also be able to make and use the claimed polynucleotides using only 
routine experimentation. 

Accordingly, based on the amendments to the claims and the comments set forth 
above, Applicants respectfully request reconsideration and withdrawal of the instant 
rejection under 35 U.S.C. §1 12, first paragraph. 

Rejection of Claims 2, 8, 15, and 16 Under 35 U.S.C. §112, Second Paragraph 

The Examiner has rejected claims 2, 8, 15, and 16 under 35 U.S.C. §112, second 
paragraph, "as being indefinite for failing to particularly point out and distinctly claim the 
subject matter which applicant regards as the invention." 

In particular, the Examiner is of the opinion that "[i]n claim 2, the recited 
limitations are vague and indefinite because the specific identity and function of the claimed 
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proteins are not known and not recited. Furthermore, there is insufficient antecedent basis for 
the limitations cited in lines 3-4." 

Applicants respectfully traverse the foregoing rejection. However, in an effort to 
expedite prosecution of the application, claim 2 has been canceled, thereby rendering the 
foregoing rejection moot as it pertains to these claims. 

Furthermore, the Examiner is of the opinion that "[c]laim 8 is indefinite because the 
specific hybridization are not recited and one of skill in the art cannot determine the metes and 
bounds of the claimed invention." 

Applicants respectfully traverse the foregoing rejection. However, in an effort to 
expedite prosecution of the application, claim 8 has been canceled, thereby rendering the 
foregoing rejection moot. 

The Examiner is also of the opinion that "[c]laims 1 5 and 16 are vague and indefinite 
because the meaning of the phrase "modulation in production of a fine chemical" is not 
known and the specific identity of the fine chemical is not known and not recited in claims 1 5 
and 16." 

Applicants respectfully traverse the foregoing rejection. Applicants respectfully 
submit that claims 15 and 16 are clear and definite in the recitation of "fine chemical." One 
of ordinary skill in the art at the time the invention was filed would understand what was 
included as a fine chemical based on the teachings of Applicants' specification and what 
was known in the art at the time of filing. Applicants' specification points out in great 
detail, at, for example, page 10, line 1 through page 16, line 5, what is included in the 
definition of a fine chemical. A fine chemical may include any fine chemical as set forth in 
claim 16, or any fine chemical as described in Applicants' specification or known in the art. 
Based on the detailed description of a fine chemical in Applicants' specification, as well as 
the interpretation of one of skill in the art at the time the application was filed, one of 
ordinary skill in the art would understand what is included as a term fine chemical. 



USSN: 09/606,740 



- 14- 



Group Art Unit: 1652 



Therefore, for the foregoing reasons, Applicants respectfully request reconsideration 
and withdrawal of the rejection of claims 15 and 16 under 35 U.S.C. §1 12, second 
paragraph. 

Rejection of Claims 7 and 8 Under 35 U.S.C. §102 
The Examiner has rejected claim 7 under 35 U.S.C. § 102(b) "as being anticipated by 
Cole et al (Accession Z98209 AL123456)." In particular, the Examiner is of the opinion 
that "Cole et al (Accession Z98209 AL 123456) teach a nucleic acid molecule comprising a 
fragment of at least 15 nucleotides of the nucleotide sequence of SEQ ID NO: 1 (see attached 
alignment)." 

Applicants respectfully traverse the foregoing rejection for the following reasons. 
For a prior art reference to anticipate a claimed invention, the prior art must teach each 
and every element of the claimed invention. Lewmar Marine v. Barient 827 F.2d 744, 3 
USPQ2d 1766 (Fed. Cir. 1987). 

Claim 7, as amended, is directed to isolated nucleic acid molecules comprising a 
fragment of at least 22 contiguous nucleotides of the nucleotide sequence of SEQ ID 
NO: 1 , or the complement thereof. As set forth in the alignment provided by the 
Examiner, Cole et al (Accession Z98209 AL123456) and SEQ ID NO: 1 do not share more 
than 21 nucleic acid residues. Therefore, Cole et al (Accession Z98209 AL 123456) does not 
teach each and every element of the claimed invention. Accordingly, Applicants respectfully 
request reconsideration and withdrawal of the foregoing rejection under 35 U.S.C. § 102(b). 

The Examiner has also rejected claim 8 under 35 U.S.C. § 1 02(b) as being 
anticipated by Smith (Accession U00019). The Examiner is of the opinion that "Smith 
(Accession U00019) teach a nucleic acid molecule which is expected to hybridize to the 
nucleic acid of SEQ ID NO: 1 since the claim does not recite the specific hybridization 
conditions (see alignment)." 

Applicants respectfully traverse the foregoing rejection. However, in an effort to 
expedite prosecution of the application, claim 8 has been canceled, thereby rendering the 
foregoing rejection moot. 
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SUMMARY 



If a telephone conversation with Applicants' Attorney would expedite the 



prosecution of the above-identified application, the examiner is urged to call the 
undersigned at (617) 227-7400. 



LAHIVE & COCKFIELD, LLP 
28 State Street 
Boston, MA 02109 
Tel. (617) 227-7400 

Dated: October 20. 2003 




Lisa M. DiRocco, Esq. 
Registration No. 51,619 
Attorney for Applicants 
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Methods predicting protein secondary structure 
improved substantially in the 1990s through the use 
of evolutionary information taken from the diver- 
gence of proteins in the same structural family. Re- 
cently, the evolutionary information resulting from 
improved searches and larger databases has again 
boosted prediction accuracy by more than four per- 
centage points to its current height of around 76% 
of all residues predicted correctly in one of the 
thre states, helix, strand, and other. Hie past year 
also brought successful new concepts to the field. 
These new methods may be particularly interesting 
in light of the improvements achieved through sim- 
ple continuing of eristing methods. Divergent evo- 
lutionary profiles contain enough information not 
only to substantially improve prediction accuracy, 
but also to correctly predict long stretches of iden- 
tical residues observed in alternative secondary 
structure states depending on nonlocal conditions* 
An example is a method automatically identifying 
structural switches and thus finding a remarkable 
connection between predicted secondary structure 
and aspects of function. Secondary structure pre- 
dictions are increasingly becoming the work horse 
for numerous methods aimed at predicting protein 
structure and function. Is the recent increase in 
accuracy significant enough to make predictions 
even more useful? Because the recent improvement 
yields a better prediction of segments, and in par- 
ticular of ft strands, I believe the answer is affirma- 
tive- What is the limit of prediction accuracy? We 
shall see. c ton An&drania p*eo* 



INTRODUCTION 

History. Linus Pauling correctly guessed the for- 
Uiation of helices and strands (14, 16) (and falsely 
hypothesized other structures). Three years before 
Pauling's guess was verified by the publications of 
the first X-ray structures (16, 17), one group had 
already ventured to predict secondary structure 
from sequence (18). Hie first-generation prediction 
methods following in the 1960s and 1970s were all 



based on single amino acid propensities (19). The 
second-generation methods dominating the scene 
until the early 1990s used propensities for segments 
of 3-51 adjacent residues (19). Basically any imag- 
inable theoretical algorithm had been applied to the 
problem of predicting secondary structure from se- 
quence. However, it seemed that prediction accuracy 
stalled at levels slightly above 60% (percentage of 
residues predicted correctly in one of the three 
states: helix, strand, and other). The reason for this 
limit was the restriction to local information. Can 
we introduce some global information into local 
stretches of residues? 

Secondary structure prediction profits from diver- 
gence. Early on, Dickerson et al. (20) realized that 
information contained in multiple alignments can 
improve predictions. Zvelebil et at (21) incorporated 
this concept into an automatic prediction method. 
However, the breakthrough of the third-generation 
methods to levels above 70% accuracy required a 
combination of larger databases with more ad- 
vanced algorithms (19, 22). The major component of 
these new methods was the use of evolutionary in- 
formation. All naturally evolved proteins with more 
than 35% pairwise identical residues over more than 
100 aligned residues have similar structures (23). 
This seemingly implies an amazing stability of 
structure with respect to sequence divergence. How- 
ever, this average figure hides the fact that neutral 
mutations are extremely unlikely. Supposedly most 
mutations result in proteins that will not adopt any 
globular structure, at all. In other words; only a tiny 
fraction of all possible proteins exist. Hence, posi- 
tion-specific profiles describing which residues can 
be exchanged against which others at which posi- 
tions contain crucial information about protein 
structure. One consequence is that stretches ■ of say 
17 adjacent residues implictly cantah^m^ 
mation about long-range interactions ami ^viron- 
ment since the profile reflects evolutionary con- 
straints. Using evolutionary divergence was the 
start key to the third-generation prediction meth- 
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PIG. 1. Profile-based searches extend evolutionary information. The cloud signifies a protein structural family for the query protein 
U, Le. ? all proteins that have a similar 3D structure. A simple pairwise comparison of U with all other proteins covers the "safe zone* of 
sequence alignment (gray circle around U). This zone can be defined, e.g., hy BLAST scores below 10" 16 or by more than 35% pairwise 
identical residues aver long alignments. Assume that there are only five other proteins (small white circles) in the safe zone falling on one 
side of U. For example, PSI-BLAST starts the next iteration with the family-spedfic profile given by the proteins found in the safe zone . 
Searching the database again with this profile reaches safely into the twilight zone (zone reached ™«»*rd by double-lined egg indicated 
in figure)* However, no current method generally reaches all members of family U. Furthermore, in particular for PSI-BLAST the new 
region may fall outside of the initial safe zone (black subregion of the safe zone). Finally, the regions that could have been reached by 
sequence-space hopping or intermediate sequence searches (dashed circles around five initial hits; (120, 121)) are not entirely covered by 
the profile-based search. The tricky bit is to avoid the possibility that the profile will pick unrelated proteins (transparent egg) and thus 
connect two separate structural families (U and X). Conclusions: (i) Iterated PSI-BLAST searches can safely identify fairly divergent 
family members, (ii) Close homologues may be lost during the extension of the family, (iii) The advanced search can lead the results astray. 



ods. Knowing 3D structure, 1 we can identify very 
distant relationships between proteins that would 
improve accuracy even further (24). Can we build 
larger and more diverged families without knowing 
structure? 



1 Abbreviations used: 3D structure, three-dimensianal (coordi- 
nates of protein structure); ID structure, one-dimensional (e.g., 
sequence or string of secondary structure); ASP, method identi- 
fying regions of structure ambivalent in response to global 
changes (1); DSSP, database and method converting 3D coordi- 
nates into secondary structure (Zk HMMSTR, hidden Markov 
modal-based prediction of secondary structure (3); JPred, method 
comlriiung other prediction methods (4. 6>. JPredZ, divergent 
profile (PSI-BLASTHused neural network prediction (6); PHD, 
simple profile-based neural network prediction (7); PHDpsi, di- 
vergent pnmle (r^~fiLASTH 
8> f PROP, divergent imtik^bas^ 
tretned find toefcd *ihVSl-BU&m 
iterative specific proifo4>ased7 iim'iim^mamSM alignment 
method (10); PSTPRBD. divergent profile 0^-BIAST)4>ased 
neural netwOTkiwed^ct^ 

diction, l ifting hidden M a r kov modelfl as input (12fc SSpro, profile- 
based advanced neural network prediction method (IS). 



New database searches extend family divergence. 
It was also recognized very early on that information 
from the position-specific evolutionary exchange 
profile of a particular protein family facilitates dis- 
covering more distant members of that family (20). 
Automatic database search methods successfully 
used position-specific profiles for searching (25). 
However, the breakthrough for large-scale routine 
searches was achieved with the development of PSI- 
BLAST (10) and hidden Markov models (12, 26). In 
particular, the gapped, profile-based, and iterated 
search tool PSI-BLAST continues to revolutionize 
the field of protein sequence analysis through its 
unique combination of speed and accuracy. More 
distant relationships are found through iteration 
starting from the safe zone of comparisons and in- 
truding deeply and reliably into the twilight zone 

Topics left uthere. This review focuses on meth- 
ods predicting secondary structure for globular pro- 
teins, in generaL At the infancy of analyzing the 
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proteome of entirely sequenced organisms, the most 
useful structure prediction methods are those that 
focus on particular classes of proteins, such as pro- 
teins containing membrane helices and coiled-coil 
regions (27-30). For predicting the topology of heli- 
cal membrane proteins, a number of new methods 
add interesting new facets (31-36). However, no 
method has truly used the flood of recent experimen- 
tal information about membrane proteins (37). 
Overall, membrane helices can be predicted much 
more accurately than globular helices. The current 
state of the art is to correctly predict all membrane 
helix topology for more than 80% of the proteins and 
to falsely predict membrane helices for less than 4% 
of all globular proteins. We have recently come 
across evidence suggesting that this figure overesti- 
mates performance (Rost, unpublished). Clearly, 
methods developed to predict helices in globular pro- 
teins go completely wrong for membrane helices! In 
contrast, porins appear to be predicted relatively 
accurately by methods developed for globular pro- 
teins (38, 39). Pew methods specifically predicting 
coiled-coil regions have been published recently (old- 
er review in (40)). Two interesting developments are 
the prediction of the dimeric state of coiled-coils (41) 
and a method predicting 3D structure for coiled-coil 
regions (42). In fact, the latter is the only «ifiting 
method predicting 3D structure below 2-A main 
chain deviation over more than 30 residues. Another 
example of successful specialized secondary struc- 
ture prediction methods is the focus on j3 turns (43, 
44). The method from the Thornton group appears to 
be the most accurate current means of predicting 
turns. Successful methods specialized in predicting 
a-helix propensities have resulted from the experi- 
mental studies of short peptides in solution (45, 46). 
Neither the turn nor the helix-in-solution methods 
have yet been combined with other secondary struc- 
ture prediction methods. 



MORE DATA + REFINED SEARCH 
PREDICTION 



BETTER 



Jones broke through by using PSI-BLAST 
searches of large databases* David Jones pioneered 
the use of iterated PSI-BLAST searches automati- 
cally (11). The most important step achieved by the 
resulting method PSIPBED has been the detailed 
strategy of avoiding pollution of the profile through 
unrelated proteins (Fig. 1). To avoid this trap, the 
database searched must be filtered first (11). At the 
CASP meeting at which David Jones introduced 
PSDPRED, Kevin Karplufl and colleagues presented 
their prediction method (SAM-T99sec), finding more 
diverged profiles through hidden Markov models 
(47, 48). Recently, Guff and Barton also successfully 
used PSI-BLAST alignments for JPred2 (see 49). 



Jennings et aL (50) explore an alternative to increas- 
ing divergence: they started with a safe zone align- 
ment through ClustalW (51) and HMMer (26) and 
iteratively refined the alignment using the second- 
ary structure prediction from DSC (52). The result- 
ing alignment is reported to be more accurate and to 
yield higher prediction accuracy than the initial 
ClustalW/HMMer alignments (50), How accurate is 
secondary structure prediction in 2000? 

Prediction accuracy peaks at 76% accuracy. The 
current best methods reach a level of 76% three- 
state per-residue accuracy (Table IX This constitutes 
a sustained level more than four percentage points 
above the last century's best method not using di- 
verged profiles (PHD in Table D. Fortunately, the 
improvement is valid for helix, strand, and nonregu- 
lar regions (information and correlation indices in 
Table D. Furthermore, significantly fewer residues 
are confused between the states helix and strand 
(BAD score, Table D. Finally, some new methods 
also improve in a more global sense by improving 
the accuracy of assigning the secondary structural 
class (all-alpha, all-beta, alpha/beta, and other) 
based on the predicted content of regular secondary 
structure (Class score, Table D. 

Sources of improvement: Four parts database 
growth, three parts extended search, two parts other. 
Jones solicited two causes for the improved accu- 
racy: (i) training and (ii) testing the method on PSI- 
BLAST profiles. Cuff and Barton examined in detail 
how different alignment methods improve (6). How- 
ever, which fraction of the improvement results from 
the mere growth of the database, which fraction 
results from using more diverged profiles, and which 
fraction results from training on larger profiles? Us- 
ing PHD from 1994 to separate the effects (8), we 
first compared anoniterative standard BLAST (53) 
search against SWISS-PROT (54) with one against 
SWISS-PROT + TrEMBL (54) + PDB (55), The 
larger database improves performance by about two 
percentage points (8). Second, we compared th 
standard BLAST against the large database with an 
iterative PSI-BLAST search. This yielded less than 
two percentage points in additional-improvement 
(8). Thus, overall, the more divergent profile s arch 
against today's databases supposedly improves any 
method using alignment information by almost four 
percentage points (PHDpsi in Table D.Theimprove- 
ment gained fay using PSI-BLAST profiles to develop 
the method is relatively small: 
on a email database of not very 
1994; e.g., PROF was trained on PSI-BLAST profiles 
of a 20 times larger database in 2000. The two differ 
by only one percentage point (Table 0, and pari of 
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TABLE I 

Accuracy of Secondary Structure Prediction Methods* 
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ClOBET 
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PROF 


77.0 




73 


0.37 


0-67 


0.65 


0.56 


82 


2.2 


PSIPRED 


76,6 


76.5-78.3 m 


73 


0.37 


0.66 


0.64 


0.56 


81 


2.5 


SSpro 


76.3 


76 


71 


0.36 


0.67 


0.64 


0.56 


83 


2.5 


JPred2 


75.2 


76.4 


70 


0.34 


0.65 


0.63 


0.64 


77 


2.4 


PHDpei 


75.1 




70 


0.29 


0.64 


0.62 


0.53 


80 


2.9 


PHD 


71.9 


71.6 


68 


0.25 


0.59 


0.59 


0.49 


77 


4.1 



Copenhagen 78" 77.8 

Wang/Yuan 53" 

a Data set and sorting: The results are compiled by EVA (58). All methods for which details are listed have been tested on 195 different 
new protein structures (EVA version February 2001). None of these proteins was similar to any protein used to develop the respective 
method- This set comprised the largest such set by February 1, 2001, for which we had results. Sorting and grouping reflect the following 
concept: if the data set is too small to diatinguinh between two methods, these two are grouped. For the given set of 195 proteins, this 
yielded three groups. Inside of each group, results are sorted alphabetically. Due to a lack of data, I could not add the performance of 
SAM-T99sec (48); on a set of 105 proteins SAM-T99sec appears comparable to the best three methods: PSIPRED, SSpro, and PROF. The 
results from the Copenhagen method are set apart, since they were not collected continuously by EVA (the method is not publicly 
available); rather they were provided by the group in Denmark for this review and thus may have been based on marginally differing 
sequence databases, 

6 See abbreviations footnote in text; Copenhagen refers to the method from the group in Denmark (63); Wang/Yuan refers to a method 
predicting secondary structural class from the amino acid composition, which may be the most accurate such method (59). 

* Three-state per-residue accuracy, i.e., number of residues predicted correctly in one of the three states, helix, strand, or other 
(conversion of DSSP states (HG) — ► helix, (KB) — ► strand; note that the per-residue accuracy tends to favour methods overpredicting 
nonregular structure). 

d Three-state per-residue accuracy published in original publication of method: PSIPRED (11), SSpro (13), JPred2 (6), PHD (122). 

* Three-state per-segment score measuring the overlap between predicted and observed segments (75, 123). 
'Per-residue information content (22X 

'Matthew's correlation coefficient for state helix (124). 

* Matthew's correlation for state strand (124). 

1 Matthew's correlation coefficient for state other (124). 

* Percentage of proteins correctly sorted into one of the four classes: all-alpha (length > 60, helix >45%, strand <5%), all-beta (length > 
60, helix <5%, strand >45fc), alpha/beta (length > 60. helix >30%, strand >20%X other (thresholds for classification from (122, 125. 126). 

4 Percentage of helical residues predicted as strand and of strand residues predicted as helix (127). 
m PSIPRED results were published for different conversions of the eight DSSP states to three states. 
"P. 

° The class accuracy for the method based on amino acid composition is taken from the original publication (59), i.e., based on a different 
data set than all other methods. 



this difference resulted from implementing new con- 
cepts into PROF (Host, unpublished; 9). 

CAUTION; OVEROPTIMISM H AS BECOME EVEN 
MORE LIKELY!* 

Seemingly improving accuracy by ignoring short 
segments. There are many ways to publish higher 
levels of accuracy. Among the simplest for secondary 
structure prediction is to convert 3 10 helices and 0 
bulges assigned by DSSP (2) to nonregular struc- 
ture. This yields higher levels of accuracy since all 
methods — on average — are better at predicting the 
middle of helices and strands than their caps and 
hence are more accurate for longer regular second- 
ary structure segments (56/ 57). When predicted 
secondary structured used to predict 3D structure. 

*Note: I added this section listing "what not-to do" primarily 
fer developers of methods, since many of the recently published 
methods fall prey to one of the problems fwmtimied. 



short helices are important. Thus, I suggest bearing 
with the more conservative conversion strategy. 

Comparing apples and oranges or too few apples 
with one another. To overstate the point: there is 
NO value in comparing methods evaluated on dif- 
ferent data sets. Most secondary structure predic- 
tion methods are available. Thus, developers may 
want to compare their results to public methods 
based on the same data set (not previously used for 
either of the two). Many methods predicting aspects 
of protein structure and function must fight with 
limited data availability. This is not at all the case 
for secondary structure prediction* Hundreds of new 
protein structures are added every year (55). If for 
some reasctti or another f sjb^ da^ 
used, developers should painafalringiy "try to esti- 
mate what "significant difference" means for their 
data set. For example, 16 new protein structures are 
dearly too few! We currently have results from 
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many prediction methods for 16 proteins. For that 
set, JPred2, PHD, PROF, PSEPKBD, SAM-T99sec, 
and SSpro are indistinguishable (58)! 

Seemingly achieve 100% accuracy by using corre- 
lated sets. Many publications on predicting second- 
ary structural class from amino acid composition 
allowed correlations between "training" and testing 
sets. Consequently, levels of prediction accuracy 
published far exceeded the possible theoretical mar- 
gins (59). A very simple operational definition for 
"independent sets" is the following: Two proteins A 
and B are correlated if the sequence similarity be- 
tween A and B suffices to predict the structure of B 
knowing A's structure. Assume we have two uncor- 
related sets of proteins Si and S2. Can we train the 
method on set SI and develop it on set S2 without 
further ado? While developing PROF, I realized that 
the answer is negative. In fact, I trained neural 
networks on about 2000 structures that had no sig- 
nificant level of sequence similarity to our original 
set of 126 proteins (22). I used the 126 proteins only 
after I had completed developing the method and 
found a prediction accuracy exceeding 80% (unpub- 
lished). When I tested PROF on a set of about 200 
new structures that had been added to PDB in the 
meantime (different from that given in Table I), 
prediction accuracy dropped. Do the 126 proteins 
differ from the set used for Table I? I failed to an- 
swer this question. Conclusion: test as test can; i.e., 
use as many independent sets of new structures as 
possible! 

EVA: Automatic evaluation of automatic predic- 
tion servers. In collaboration with Volker Eyrich 
(Columbia), Marc Marti-Renom and Andrej Sali 
(both from Rockefeller), and Florencio Pazos and 
Alfonso Valencia (both from CNB Madrid), we have 
started to address the above problems through the 
automatic server EVA (58). Leszek Rychlewski 
(HMCB Warsaw) and Dani Fischer (Ben-Gurion 
University) are implementing similar ideas in live- 
Bench (60). The simple concept is the following: 
Take the N newest experimental structures added to 
PDB, send the sequences to all prediction servers, 
collect the results, and accumulate a continuous- 
evaluation of prediction accuracy every week. EVA 
has been evaluating secondary structure prediction 
methods for more than 6 months now. I found it 
in structive to see how the ^ranking* of methods ini- 
tially changed from week to week due to too small 
sets; Currently, EVA also provides results for eval- 
uating comparative modeling (Sali group) and resi- 
due-residue contacts (V alenda group) . We hope that 
EVA will eventually simplify life for developers, ref- 
erees, editors, and users. 



ROOT 




CLEVER. METH DS CAN BE MORE ACCUHATE 

SSpro: Advanced recursive neural network system. 
The only method published recently that appears to 
improve prediction accuracy significantly not 
through more divergent profiles but through the 
particular algorithm is SSpro (13). The major idea of 
the method aims at solving the following problem. 
When, e.g., training neural networks it is imp rtant 
to avoid correlations between training samples pre- 
sented successively to the system. A neural network 
may be presented with the window around residue 
11 in protein X at time step T and residue 7 in 
protein Y at step T + 1. Thus, the system never 
learns that secondary structure correlates between 
adjacent residues. The result is that regular second- 
ary structure segments are predicted— on aver- 
age—at a length half that observed (19). PHD ad- 
dressed this problem by a second-level structure-to- 
structure network that was trained on the predicted 
secondary structure from the first-level sequence-to- 
structure network (22). Most authors have since im- 
plemented this idea (in particular PSEPRJED and 
JPred2). Pierre Baldi and colleagues deviated sub- 
stantially from this concept. Instead of using an 
additional network, they embedded the correlation 
into one single recursive neural network. In princi- 
ple, the idea of a recursive network had been imple- 
mented before (61). However, the particular details 
of the algorithm implemented in SSpro are novel 
and — as Table I illustrates — prove highly success- 
ful. 

HMMSTR: Hidden Markov models for connecting 
library of structure fragments. Can we predict sec- 
ondary structure for protein JJ by local, sequence 
similarity to segments of known structures (S} even 
when overall U differs from any of the known struc- 
tures {S}? Yes, as shown by many nearest-neighbor- 
based prediction methods, the most successful of 
which seems to be NSSP (62). A conceptually quite 
different realisation of the same concept has been 
implemented in HMMSTR by Chris Bystrofi; David 
Baker, and colleagues (3). First, build a library of 
local stretches (3-19) of residues with "basic struc- 
tural motifs" (I sites). Second, assemble these local 
motifs tlnnough"Mdden"Markov-models~introducing 
structural context on the level of supersecondaiy 
structure. Thus, the goal is to predict protein struc- 
ture through identification of "grammatical units of 
protein structure formation.* Although HMMSTR 
intrinsically aims at predicting higher order aspects 
of 3D structure, a side result^ 
secondary structure. I find two result (i) 
The authors do not find any significant effect of 
"overoptimizing* their method; Le., HMMSTR ap- 
pears as accurate in predicting secondary structure 
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far proteins known today as it will be for those 
known next year, (ii) Three-state per-residue accu- 
racy is reported to be about 74% (3). If this estimate 
is correct, HMMSTR is m re accurate at predicting 
secondary structure than most existing methods and 
almost as accurate as the state-of-the-art methods 
(Table I). 

And the winner is? The reason for the particular 
focus of this review on a small number of methods is 
largely that I could compare the selected methods to 
one another based on new proteins. A particular 
method that was not available to me may turn out to 
mark the most substantial breakthrough in the 
field, A Danish group developed a neural network- 
based method that is most amazing in many re- 
spects (63). (i) The authors estimate the method to 
yield levels above 77% prediction accuracy (the title 
of their article is slightly misleading). If true, this is 
the best current method. Like PSIPRED, JPred2 J 
and PROP, the method uses PSI-BLAST profiles as 
input and like most methods since PHD a two-level 
approach addressing the problem of predicting short 
segments, (ii) A concept that had not been published 
before is to replace the standard three output units 
(for helix, strand, and other), by nine output units 
additionally coding for the secondary structure 
states of the residues before and after the central 
one (dubbed "output expansion"), (in) Also new is the 
particular way of weighting the average over differ- 
ent networks by the overall reliability of the predic- 
tion for that network and the mere number of dif- 
ferent networks considered (up to 800!). This 
impressive number of networks may prevent large- 
scale genome analyses based on this method. How- 
ever, the msgor point is: Did the authors overesti- 
mate performance? The authors tested their method 
in a way that most developers would assume to be 
error-proof. However, their testing protocol is very 
similar to the one that I applied when significantly 
overestimating the accuracy of PROP (>81%). Obvi- 
ously, the similarity of these two situations may 
very well be purely coincidental! 

Plethora of new concepts for secondary structure 
prediction^ The following five methods are a small 
subset of new ideas colored to improve secondary 
structure prediction. (i) Ouali and Kfog (64) combine 
neural networks and rule-based statistics in a cas- 
cade of classifiers. Based on a similar data set they 
estimate a level of prediction accuracy comparable 
to that of JPred2 (see Table J),$i) Oiamdoaia and 
Karplus (57) combined simp 
(two output states) with ne tworks trained on differ- 
ent tasks and a particular variant of early stopping; 
input is nondivergent alignments picked from the 
safe zone (Fig. 1)* Based on a protocol similar to that 



applied by the Danish group (63), the authors esti- 
mate a level of >76% accuracy, i.e., a level that if it 
h Ids up is similar to SSpro (Table D. (iii) Suppos- 
edly the simplest new method that claims to almost 
approach the performance of PHD combines the in- 
formation for secondary structure formation con- 
tained in amino acid singlets, doublets, and triplets, 
(iv) Schmidler et al. (65) use a simple statistical 
model; the novel aspect is to replace compiling sta- 
tistics over fixed stretches of N residues by segments 
signifying regular secondary structure (helix, 
strand). The underlying formalism resembles a hid- 
den semi-Markov model allowing one to explicitly 
incorporate particular propensities such as helix 
caps (66). Based on noncomparable data sets the 
authors estimated prediction accuracy to be 69%; if 
correct, this is impressive for a method not using 
alignment information, (v) Without claims to sur- 
prising levels of accuracy, Figureau et ah (67) com- 
bine cleverly chosen pentapeptides from the data- 
base to obtain the final prediction. 

Secondary structural class predicted almost as ac- 
curately as by experiment. Grouping proteins into 
secondary structure classes (all-alpha, all-beta, al- 
pha/beta, and other) appears to be a useful initial 
approach for classifying proteins (27, 68). Surpris- 
ingly, such classes can be predicted successfully 
based merely on the overall amino acid composition 
of a protein (59, 69, 70). More and more increasingly 
complex and genial methods address this reduced 
goal; reported levels of prediction accuracy approach 
100%. Recently, Wang and Yuan explained these 
high values by insufficient testing schemes and chal- 
lenged that a four-state accuracy of 60% comprises 
the maximum for methods based solely on composi- 
tion (59). Obviously, it is much easier to predict class 
starting from the detailed information about evolu- 
tionary profiles for the entire sequence than by re- 
stricting the input to composition. In fact, the best 
current methods also improve the accuracy in pre- 
dicting secondary structure class considerably (Ta- 
ble I). The differences between observed and pre- 
dicted composition of secondary structure are now 
below 6% for helix and strand. This is fairly close to 
what experimental low-resolution (circular dichro- 
ism, Fourier transform-induced spectroscopy) meth- 
ods achieve at their best (57). 

COMBINING MEDIOCRE AND GOOD METHODS 
MAYBE BEST 

Combination improves on nonsystematic errors. 
Any prediction method has two sources of errors: (i) 
systematic errors, e.g., through nonlocal effects, and 
(ii) white noise errors caused by, e«g. f the succession 
of the examples during training neural networks. 
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Theoretically, combining any number of methods 
improves accuracy as long as the errors of the indi- 
vidual methods are mutually independent and are 
not only systematic (71). PHD — and more recently 
other methods (6, 57, 63) — used this fact in combin- 
ing different neural networks. The idea of combining 
different predicti n methods has been around in 
secondary structure prediction for a long time (19); 
Cuff and Barton (see 4, 5) implemented it in JPred 
for different third-generation methods. In particu- 
lar, JPred uses a simple expert rule for compiling 
the final average. King et al. (72) have tested a 
variety of different combination strategies. Selbig et 
al* (73) have compiled the jury through an elabo- 
rated decision-tree-based system. Guermeur et al. 
(74) have used a more refined variant of the JPred 
idea of weighting methods. Overall, combinations of 
independent prediction methods seem to yield levels 
of accuracy higher than that of the single best 
method. However, for every protein one method 
tends to be clearly superior to the combined predic- 
tion (Fig. 2B). Is it really wise to include signifi- 
cantly inferior methods into a combined prediction? 
No: averaging over all methods used for EVA de- 
creased accuracy over the best individual methods, 
although averaging over the better ones waa better 
than averaging the best ones (Host, unpublished 
results). Is there any criterion for when to include a 
method and when not to do so? Concepts weighting 
th individual methods based on their accuracy and 
"entropy* (63) appear successful only for large num- 
bers of methods (63; Rest, unpublished results). 
Nevertheless, methods that are significantly over- 
trained can improve when combined (Krogh, unpub- 
lished results). More rigorous studies for the optimal 
combination may provide a better picture. The tech- 
nical problem of utilizing many methods in a public 
server is that the field is advancing too fast: today's 
methods are more accurate than averages over yes- 
terday's methods (hence the JPred server now re- 
turns JPred2 results by default). 

WHAT DOES 76% ACCURACY MEAN, IN PRACTICE? 

Your protein may be predicted worse or better than 
average. A few problems in estimating expected 
prediction accuracy are described above. However, 
another problem is relevant for users of prediction 
methods: A sustained level of 76% accuracy does 
NOT mean that 76% of the residues in your protein 
of unknown structure U are correctly predicted. In 
contrast, prediction accuracy varies substantially 
between proteins (Fig. 2A). It seems that audi vari- 
ations are intrinsic to any method predicting aspects 
of protein structure and function. What can you then 
expect as accuracy for your protein when using a 
state-of-the-art method? Given a divergent family 



(Table ID, the answer is 66-86%. Do you learn from 
comparing different methods? 

Combining methods improves on average but you 
may also lose. Averaging over many methods 
helps, on average. However, most often'some m th- 
ods are m re accurate than the average (Fig. 2B). 
Furthermore, there are examples of proteins pre- 
dicted poorly by all methods (Fig. 2B), i.e., for which 
all methods agree by mistake (data not shown). 
Thus, trying to use many methods may not provide 
the answer to the question whether the prediction 
for your protein is more likely to be below or above 
average. Are there alternative ways to spot more 
reliably predicted regions? 

More reliable predictions are more accurate. Re- 
liability indices as provided by most methods corre- 
late very well with prediction accuracy (Fig. 3). This 
implies that you can easily identify regions that are 
more likely to be predicted accurately than others. 
Furthermore, if your protein has many residues pre- 
dicted at low levels of reliability, you may correctly 
suspect that your protein is predicted at a level 
below average. Plotting coverage versus accuracy 
(Fig. 3) also illustrates how beneficial more diver- 
gent profiles are to make predictions more useful. 
For example, PSIFKED has more than half of all 
residues predicted at levels that would be reached 
on average when comparing two known structures 
(75) (Fig. 3, dotted line). 

ARE SECONDARY STRUCTURE PREDICTIONS 
USEFUL, IN PRACTICE? 

Regions , likely to undergo structural change pre- 
dicted successfully. Young etoL (1) have unraveled 
an impressive correlation between local secondary 
structure predictions and global conditions. The au- 
thors monitor regions for which secondary structure 
prediction methods give equally strong preferences 
for two different states. Such regions are processed 
combining simple statistics and expert rules. The 
final method is tested on 16 proteins known to un- 
dergo structural rearrangements and on a number 
of other proteins. The authors report no false posi- 
tives and identi^most known atrad^rral switches. 
Subsequently, the group applied the method to Hie 
myosin family, identifying putative switching re- 
gions that were not known before, but appeared to 
be reasonable candidates (76). I find ibis method 
most remarkable in two ways: (i) it is the most 
general method using predictions of pr^ 
tore to predict some aspects of function and (ii) it 
illustrates that predictions may be useful even when 
structures are known (as in the case of the myosin 
fondly). 
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FIG. 2. Prediction accuracy varies substantially fbr different proteine. All results are based on 150 novel protein structures not used 
to develop any of the methods shown (68). The considerable difference in the three-state accuracy between different proteins is valid for 
all methods (A, percentage of all ISO proteins predicted at a given level of accuracy; one standard deviation is on the order of 10 percentage 
points). On average, different methods predict different proteins at higher levels (B, for each protein and each method, the difference 
between the per-protein average over all six methods is shown; negative values imply that the respective method is better than the 
average). Conclusions: (i) If you predict secondary structure for your protein with a method of 76% accuracy, the actual accuracy fbr that 
protein may be anywhere between 50 and 90%. (ii) As to be expected: moat often some methods are more accurate than the average over 
many methods. 



Classifying proteins based on secondary structure 
predictions in the context of genome analysis* Pro- 
teins can be classified into families based on pre- 
dicted and observed secondary structure (27, 68). 
However, such procedures have been limited to a 
very coarse-grained grouping only exceptionally use- 
ful for inferring function (Table It). Nevertheless, in 
particular, predictions of membrane helices and 
coiled-coil regions are crucial for genome analysis. 
Recently, we came across an observation that may 
have important implications for structural genom- 
ics, in particular: More than one-fifth of all eukary- 
otic proteins appeared to have regions longer than 
60 residues apparently lacking any regular second- 
ary structure (77). Most of these regions were not of 
low complexity, Le., not composition-biased. Sur- 
prisingly, these regions appeared evolutionarily as 
conserved as ^ respective pro- 

teins. This application of secondary structure pre- 
diction may cod in classifying proteins, in separating 
domains, and possibly even in identifying particular 
functional motifs. 



Aspects of protein function predicted based on ex- 
pert analysis of secondary structure. The typical 
scenario in which secondary structure predictions 
facilitate learning about function is one in which 
experts combine their predictions and their intu- 
ition, most often to find similarities to proteins f 
known function but insignificant sequence similar- 
ity (39, 78-89). Usually, such applications are based 
on very specific details about predicted secondary 
structure (some examples are shown in Table II). 
Thus, these successful correlations of secondary 
structure and function appear difficult to incorpo- 
rate into automatic methods. 

Exploring secondary structure predictions to im- 
prove database searches. Initially, three groups in- 
dependently applied secondary structure predic- 
tions for fold recognition, le., the detection of 
structural similarities between proteins of unrelated 
sequences (90-92). A few years later, almost every 
other fold recognition/tiireading method has 
adopted this concept (98-102). Two recent methods 
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TABLE n 

Using Secondary Structure Predictions, in Practice 



How to obtain toe beet results? 



Identify membrane proteins? 



Classify through coiled-coil regions? 



Classify through secondary structure content? 



Identify domains or structural regions? 



Monitor influences of point mutations? 



Find hi riding sites or motifs? 



Infer nmctional/stxnctural similarity? 



The major source of improvement is the divergence of the multiple sequence 
alignment used for prediction. Thus, if you have a small famil y, the 
expected prediction accuracy is lower. 

Particularly sensitive to divergence are the reliability indices; i.e., less 
divergence yields overestimated reliability indices. 

The most successful strategy to rind the most reliably predicted regions may 
be to use the reliability index provided by a method rather than the 
agreement between different methods. 

If you know there are nonglobular or structural domains in your protein, 
chop it up before you build the alignment. 

If you can improve the alignment, try to do so before the prediction. 

Predicted membrane helices indicate that your protein is not globular. The 
accurate membrane predictions are usually more reliable than those for 
globular proteins. Thus, membrane helix predictions should be given 
preference. Globular methods often do not predict globular helices at 
positions of membrane helices; rather, often membrane helices are 
predicted as strand by mistakenly applied globular methods. In contrast, 
globular methods appear relatively more accurate for porin-like beta- 
strand membrane regions. 

Detection of membrane proteins has less than a 3% error rate for the best 
methods. Most helices are correctly predicted, yet the number of helices 
may nevertheless vary. Helix caps are clearly predicted inaccurately. Note 
that general methods predicting three-state secondary structure for 
globular proteins also predict caps less accurately. 

Predictions of long coiled-coil regions clearly indicate that your protein is 
locally nonglobular. Long coiled-coil proteins are likely to be structural 
proteins. Longer regions are predicted more accurately. 

Classifying proteins according to the secondary sUu ct ur e composition is 
helpful, but arbitrary. One hope may be to infer from the predi ct ed 
secondary structure content that a particular protein is not typical. 
However, this attempt fails, since known protein structures vary 
significantly between 10 and 90% of regular secondary structure (helix, 
strand). Thus, secondary structure composition does not help to predict 
globularity. 

If you see two separate secondary structure patterns, you may suspect that 
the protein has two structural domains. Ad extreme example is an N- 
terminal all-alpha region and a Oterminal all-beta region. 

If you have to cut your protein, stay more than two residues away from 
predicted helices and strands. 

Secondary structure prediction methods arc— on average — as accurate in 
predicting the overall content of secondary structure as are careful CD 
and PUR methods. However, such methods allow you to monitor in detail 
structural responses to mutations. Such changes are less likely to be 
reflected as accurately by prediction methods. 

Moot often, binding sites lie in nonregular secondary structure eleme nt s. 
For example, we have not predicted regular secondary structure for any 
of the known nuclear localization signals (128). 

Secondary structure predictions do not suffice to identify binding mottth, 
such as the sine-finger II motif. However, the combination of sequence 
motif and predicted secondary structure may be very helpful. 

If you know the functionfetnicture of protein A and want to infer whether B 
shares this fbnction/structuTe, a similarity in the local secondary 
structure may help you substantially. 



extended the concept by not only refining the data- 
base search, but by actually refining the quality of 
the alignment through an iterative procedure (50, 
103). A related strategy has been implored by Ng 
and the Henikoffs to improve predictions and align- 
ments for membrane proteins (104)* 



From W predictions to 2D and 3D strw^u^ Are 
secondary structure predictions accurate enough to 
help predict higher order aspects ^ 
tore automatically? For 2D (interresidue contacts) 
predictions, Baldi et aL (105) have recently im- 
proved the level of accuracy in predicting 0-etrand 
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FIG. 8- Prediction accuracy correlates with reliability. The conclusion from Fig. 2A is that you have a poor idea of how well a method 
performs when applied to your protein of unknown structure. Fortunately, there ie a way out of this dilemma: Most methods now provide 
an index measuring the reliability of the prediction for each residue. Shown is the accuracy versus the cumulative percentages of residues 
predicted at a given level of reliability (coverage vs accuracy). For example. FStFRED and PROF reach a level above 88% for about 60% 
of all residues (dashed line). This particular line is chosen since secondary structure assignments by DSSP agree to about 88% for proteins 
of similar structure. Although JPred2 is only marginally less accurate than PSIPRED and PROF (Table I), it reaches this level of accuracy 
for less than half of all residues. Conclusions: (i) Reliability indices are extremely valuable to spot regions of more-likely-to4ie-correct 
predictions, (ii) These indices also address the problem of variation: if many residues are predicted with high reliability, your protein is 
mare likely to be predicted more accurately than average (Fig. 2A). 



pairings over earlier work (106) by using another 
elaborate neural network system. For 3D predic- 
tions, the following list of five groups exemplifies 
that secondary structure predictions are now a pop- 
ular first step toward predicting 3D structure, (i) 
Ortiz et al. (107) successfully use secondary struc- 
ture predictions as one component of their 3D struc- 
ture prediction method, (ii) Eyrich et al. (108, 109) 
minimize the energy of arranging predicted rigid 
secondary structure segments, (iii) Lomize et al 
(110) also start from secondary structure segments, 
(iv) Chen et al. (Ill) suggest using secondary struc- 
ture predictions to reduce the complexity of molecu- 
lar dynamics simulations, (v) Levitt and co-workers 
(see 112, 113) combine secondary structure-based 
simplified presentations with a particular lattice 
simulation attempting to enumerate all possible 
folds. 

AND WHAT IS THE UMTT OF PREDICTION 
ACCURACY? 

68% is a limit, but shall we ever reach close to 
there? Protein secondary structure formation is in- 
fiuenced by long-range interactions (45, 46, 114) and 
by the environment (1, 115). Consequently, 



stretches of up to 11 a<Jjacent residues (dubbed cha- 
meleon after (114)) can be found in different second- 
ary structure states (116-118). Implicitly, such non- 
local effects are contained in the exchange patterns 
of protein families. This is reflected by the fact that 
strand is predicted almost as accurately as helix 
(Table I), although sheets are stabilized by more 
nonlocal interactions than helices. Local profiles can 
even suffice to identify structural switches (1, 76). 
Surprisingly, we can find some traces of folding 
events in secondary structure predictions (119). 
Even more amazing is a study suggesting that align- 
ment-based methods achieve levels of accuracy for 
chameleon regions similar to those for all other re- 
gions (118). Secondary structure assignments may 
vary for two versions of the same structure. One 
reason is that protein structures are not rocks but 
dynamic objects with some regions being more mo- 
bile than others. Another reason is that any assign- 
ment method must choose parting 
(e*g. y DSSP chooses a cutoff in the 'icSnSmS^^uf^f 
of a hydrogen bond). Consequently f assignments dif- 
fer by about 6-15 percentage points between differ- 
ent X-ray versions or different NMR models for the 
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same protein (Andersen and Bost y unpublished re- 
sults), and by about 12 percentage points between 
structural homologues (75). The latter number pro- 
vides the upper limit for secondary structure predic- 
tion of error-free comparative modeling. I doubt that 
ab initio predictions of secondary structure will ever 
become more accurate than that. Hence, I believe a 
value of around 88% constitutes an operational up- 
per limit for prediction accuracy. After the advances 
over the past 2 years we reached greater than 76% 
accuracy. Thus, we need to achieve another 12 per- 
centage points (or even less). What ia the major 
obstacle to reaching another 6 percentage points 
higher? The size of the experimental database as 
suggested (117)? I doubt this, since PHDpsi trained 
on only 200 proteins using PSI-BLAST input is al- 
most as accurate as PSIPKED trained on 2000 pro- 
teins (Table I)- Will the current explosion of ae- 
quences boost accuracy? In fact, current databases 
have less than 10 homologues for more than one 
third of the 150 proteins tested (Table I) and more 
than 100 for only 20% of the proteins. Although 
based on too a small set to draw conclusions, for 
these 20% highly populated families the accuracy of 
PROF was 4 percentage points above average (data 
not shown). Thus, larger databases may get us 6 
percentage points higher, and it may not. The an- 
swer remains nebulous. 

DISCUSSION 

Methods improved significantly over the past 2 
years. Growing databases and improved search 
techniques (Fig. 1)— predominantly through the it- 
erated PSI-BLAST tool— yielded a substantial im- 
provement in secondary structure prediction accu- 
racy over the past 2 years. State-of-the-art methods 
now reach sustained levels of 76% prediction accu- 
racy (Table D. Even more impressively, about 60% of 
all residues are predicted at levels reaching the level 
of agreement between X-ray and NME structures 
(Fig. 3), However, novel ideas have also been shown 
to improve prediction accuracy. A standard way to 
increase the confidence in a particular prediction is 
to look at the results from many different prediction 
methods. This strategy is frequently^ccessful and 
has been brought to perfection aver recent years. 
However, often the best method is better than Hie 
average over many methods (Fig. 2B). While struc- 
ture prediction is coming of age, developers and us- 
ers slowly learn to reduce overestunations. How- 
ever, the correlations between proteins at times of 
database explosions are becoming more difficult to 
control* It seems that only continuous, automatic 
evaluation servers will be able to handle this chal- 
lenge in the future (58, 60). 



Secondary structure predictions are at the base of 
structure-based sequence analysis. Almost a de- 
cade after the original breakthrough, prediction 
methods are now increasingly explored by wet-lab 
biologists to analyze their protein of interest. Sec- 
ondary structure predictions are used automatically 
by methods aiming at higher dimensional aspects of 
protein structure and at improving database 
searches and alignment accuracy. One method has 
successfully related secondary structure predictions 
automatically to functional aspects (1, 76). However, 
secondary structure-based identifications of binding 
sites or other functional aspects are still restricted to 
single-case expert analyses. 

And now we run human? The field has advanced 
considerably over the past 2 years, and more im- 
provement appears to lie ahead. Prediction methods 
sire fast enough to analyze entire genomes, and for 
particular examples the resulting classifications are 
relevant to structural and functional genomics (28, 
68). Nevertheless, to play the devil's advocate: The 
field is not up to the challenge of the human se- 
quences to be dubbed into the database very soon. 
We are missing a variety of approaches relating 
secondary structure predictions explicitly to func- 
tion, such as given by ASP (1). Obviously, this re- 
mark may apply to bioinfonnatics, in general: The 
year 2001 will commence with the publication of the 
entire human genome; we must rush to get ready for 
the data flood. 

Thanks an> extended to Jinfeng liu (Columbia University) for 
computer assistance and the collection of genome data Bete; to 
Jinfeng liu and Dariusz Przybylski (Columbia University) for 
providing preliminary information and programs; and to Qaua 
Andersen and S*ren Brunak (CBS Copenhagen tor helpful com- 
ments on the manuocriptL Particular thanks are doe to Volker 
Eyrich (Colombia University) for progrsnumng and maintaining 
most of the immensely valuable software that runs the EVA and 
META-PredictProtein servers! 
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